Reply to Tan et al.: Differences between real and simulated proteins in multiple sequence alignments.

نویسندگان

  • Kieran Boyce
  • Fabian Sievers
  • Desmond G Higgins
چکیده

Tan et al. (1) comment on our earlier paper regarding the accuracy of multiple sequence alignments (MSAs) using different guide tree topologies (2). We stress that the scope of our result was confined to alignments of very large numbers of protein sequences with known structures, where accuracy was measured against structure-based alignments. We point out that this result could not be translated to a strictly phylogenetic view. Tan et al. (1) demonstrate that, using a phylogenetic perspective, one can get the opposite result to ours. Given how they configure their test system, Tan et al.'s result is to be expected and easy to explain. If one simulates MSAs with many indels at random locations and then tests correspondence between alignments, including gaps in the test, then guide tree topology must have a huge effect. This is more or less inevitable. Our benchmark test sets do not have gaps at random locations. Gaps are mostly confined to loops between the main secondary structure regions. During evolution indels may occur in secondary structure elements, but rarely. Occurring indels may be cancelled out by compensating events that restore length and periodicity of the element. In contrast, gaps in loops are common and tolerated during evolution. This extreme imbalance in indel frequency has been well known for decades (3). The parameterization for the ALF simulation program comes partly from ref. 4, which describes such an imbalance with indels predominantly at exposed positions in structures. ALF can be used to simulate alignments with indel probabilities across sites from a distribution. Tan et al. (1) chose a uniform distribution. One has to ask what kind of sequences these simulated ones might be most similar to in reality. What kinds of biological sequences allow indels equally easily at any position? Such sequences may exist in intergenic regions but will be difficult to align after even moderate sequence divergence. Equal probabilities of indels at all sites suggest sequences not under any selective or structural constraint. All our sequences are proteins with 3D structural information and constrained structure. Our main tests used a combination of PFAM sequences and Homstrad structure-based alignments. We also used Balibase but only to make a minor point. With the large tests the effect we described was mainly clear for more than 1,000 sequences; that is the upper limit of the tests in Tan et al. (1). On a much smaller scale, we can see …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Influence of conservation on calculations of amino acid covariance in multiple sequence alignments.

It has long been argued that algorithms that find correlated mutations in multiple sequence alignments can be used to find structurally or functionally important residues in proteins. We examined the properties of four different methods for detecting these correlated mutations. On both simple, artificial alignments and real alignments from the Pfam database, we found a surprising lack of agreem...

متن کامل

Confidence in comparative genomics.

Comparative sequence analysis has become a widespread approach for identifying and characterizing functional elements encoded within genomic sequences. Marked by early successes (for review, see Hardison 2000), a tremendous amount of sequencing capacity has been, and continues to be, utilized for sequencing genomes of related species. Indeed, the choice of genomes selected for sequencing has le...

متن کامل

Evolutionary HMMs: a Bayesian approach to multiple alignment

MOTIVATION We review proposed syntheses of probabilistic sequence alignment, profiling and phylogeny. We develop a multiple alignment algorithm for Bayesian inference in the links model proposed by Thorne et al. (1991, J. Mol. Evol., 33, 114-124). The algorithm, described in detail in Section 3, samples from and/or maximizes the posterior distribution over multiple alignments for any number of ...

متن کامل

Sequence Analysis Probalign: Multiple sequence alignment using partition function posterior probabilities

Motivation: The maximum expected accuracy optimization criterion for multiple sequence alignment uses pairwise posterior probabilities of residues to align sequences. The partition function methodology is one way of estimating these probabilities. Here, we combine these two ideas for the first time to construct maximal expected accuracy sequence alignments. Results: We bridge the two techniques...

متن کامل

Extracting multiple structural alignments from pairwise alignments: a comparison of a rigorous and a heuristic approach

MOTIVATION Multiple structural alignments (MSTAs) provide position-specific information on the sequence variability allowed by protein folds. This information can be exploited to better understand the evolution of proteins and the physical chemistry of polypeptide folding. Most MSTA methods rely on a pre-computed library of pairwise alignments. This library will in general contain conflicting r...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Proceedings of the National Academy of Sciences of the United States of America

دوره 112 2  شماره 

صفحات  -

تاریخ انتشار 2015